Let’s revisit Ames Hoursing data.
library(AmesHousing)
?ames_raw
Use data of ames_raw up to 2008 to predict the housing
price for the later years.
ames_raw_2008=ames_raw[ames_raw$`Yr Sold`<2008,]
ames_raw_2009=ames_raw[ames_raw$`Yr Sold`>=2008,]
Use the same loss function calculator.
calc_loss<-function(prediction,actual){
difpred <- actual-prediction
RMSE <-sqrt(mean(difpred^2))
operation_loss<-abs(sum(difpred[difpred<0]))+sum(0.1*actual[difpred>0])
return(
list(RMSE,operation_loss
)
)
}
Use nonlinear methods discussed in ch 7 of the book. Can you make a better prediction?
Your code:
#
#
Your answer:
Please write your answer in full sentences.
Let’s revisit COVID data. I’ve divided the data into training and testing data.
Train_COVID<-readRDS("Train_COVID.rds")
Test_COVID<- readRDS( "Test_COVID.rds")
Try the method described in Ch7. See if you can improve the performance.
Your code:
#
#
Your answer:
Please write your answer in full sentences.
This question relates to the College data set.
data(College,package = "ISLR2")
Your code:
#
training_vec<- sample(1:777, 543, replace=FALSE)
training_data<- College[training_vec,]
test_data<- College[-training_vec,]
Your answer:
Please write your answer in full sentences.
Your code:
#
#
Your answer:
Please write your answer in full sentences.
Your code:
#
#
Your answer:
Please write your answer in full sentences.
Your code:
#
#
Your answer:
Please write your answer in full sentences.
Fit some of the non-linear models investigated in chapter 7 to the
Auto data set. Is there evidence for non-linear
relationships in this data set? Create some informative plots to justify
your answer.
Your code:
#
#
Your answer:
Please write your answer in full sentences.
GAMs are generally fit using a backfitting approach. The idea behind backfitting is quite simple. We will now explore backfitting in the context of multiple linear regression. Suppose we would like to perform multiple linear regression, but we do not have software. Instead, we only have software to perform simple linear regression. Therefore, we take the following iterative approach: we repeatedly hold all but one coefficient estimate fixed at its current value and update only that coefficient estimate using a simple linear regression. The process is continued until convergence—that is, until the coefficient estimates stop changing.We now try this out on a toy example.
> a <- y - beta1 * x1
> beta2 <- lm(a ~ x2)$coef [2]
> a <- y - beta2 * x2
> beta1 <- lm(a ~ x1)$coef [2]
Your code:
#
#
Your code:
#
#
Your answer:
Please write your answer in full sentences.
Your code:
#
#
Your answer:
Please write your answer in full sentences.
Your code:
#
#
Your answer:
Please write your answer in full sentences.
This question uses the variables dis (the weighted mean of distances to five Boston employment centers) and nox (nitrogen oxides concentration in parts per 10 million) from the Boston data. We will treat dis as the predictor and nox as the response. (a) Use the poly() function to fit a cubic polynomial regression to predict nox using dis. Report the regression output, and plot the resulting data and polynomial fits.
Your code:
#
#
Your answer:
Please write your answer in full sentences.
Your code:
#
#
Your answer:
Please write your answer in full sentences.
Your code:
#
#
Your answer:
Please write your answer in full sentences.
Your code:
#
#
Your answer:
Please write your answer in full sentences.
Your code:
#
#
Your answer:
Please write your answer in full sentences.
Your code:
#
#
Your answer:
Please write your answer in full sentences.
Starting with one predictor, we want to estimate a function \(f\) that describes the outcome of interest \(y\) such that \[y=f(x)+\epsilon\] where \(\epsilon\) is some error.
In linear regression, we assumed the functional form to be a line \[f(x)=\alpha+\beta x\] where \(\alpha\) is the intercept and \(\beta\) is the slope. This simplification allowed us to summarize the relationship between the two variables using just one number \(\beta\). In reality, this is a gross simplification, and in particular, when prediction accuracy is more important than the description of the trend in the relationship, we may want to use more flexible methods.
For example, if your data looks like the following. What do you do?
clearly, linear relation does not hold globally. Even in such cases, we could look at local relationships and then sew them together to get a global picture.
The simplest way to estimate \(f\)
at \(x_i\) locally is by averaging the
\(y\)s corresponding to \(x\)s near \(x_i\).
\[\hat{f}(x_i)=\sum_{j\in N(x_i)} y_j /
n_i\] where \(N(x_i)\) indexes
\(n_i\) neighbors of \(x_i\).
\(N(x_i)\) can be defined as you wish, but the most popular choice is to use a symmetric neighborhood consisting of the nearest \(2k + 1\) points:
\[N(x_i) = { max(i-k, 1), \dots, i-1, i, i + 1,\dots, min(i+k, n) }.\]
For example, if we were to use the 10 closest points so that
If we keep repeating the procedure for each point and connect the
dots together, we get.
It looks surprisingly well on the left, but it’s clearly too wigly on the right. Usually ends are pretty bad and we can’t use points at the ends that do not have observations.
If we increase the neighborhood size to 20
Now the left side is too smooth and the right side looks more
decent.
Instead of using a simple mean, we can fit a regression at each neighborhood. Then make a prediction at each \(x_i\) so that \[\hat{f} ( x_i ) = \hat{\alpha}_i + \hat{\beta}_i x_i ,\] where \(\hat{\alpha}_i\) and \(\hat{\beta}_i\) are OLS estimates based on points in a neighborhood \(N(x_i)\) of \(x_i\). This is actually easy to do thanks to well-known regression updating formulas. Extension to weighted data is obvious. It’s much better than running means.
If we were to increase the neighborhood size to 20
As you can see from the figures, some Kernels are local because there is a border where things don’t matter. Whereas Gaussian kernel is an example of a global Kernel because all the points contribute.
Using Epanechnikov kernel with lambda 0.05 and 0.2.
geom_smooth.A variant uses robust regression in each neighborhood.
By default setting loess did not work well on the left example. To take into account the small bandwidth you need to specify the span option.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : k-d tree limited by memory. ncmax= 300
Scatter plot smoothing with loess
require(graphics)
with(cars, scatter.smooth(speed, dist, lpars =
list(col = "red", lwd = 3, lty = 3)))
Kernel Regression Smoother
require(graphics)
with(cars, {
plot(speed, dist)
lines(ksmooth(speed, dist, "normal", bandwidth = 2), col = "red")
lines(ksmooth(speed, dist, "normal", bandwidth = 5), col = "blue")
legend("topright",legend=c("bandwidth = 2","bandwidth = 5"),col=c("red","blue"),lty=1,lwd=2,cex=.8)
})
2D Kernel density estimate
n <- 10000
x1 <- matrix(rnorm(n), ncol = 2)
x2 <- matrix(rnorm(n, mean = 3, sd = 1.5), ncol = 2)
x <- rbind(x1, x2)
oldpar <- par(mfrow = c(2, 2), mar=.1+c(3,3,1,1), mgp = c(1.5, 0.5, 0))
smoothScatter(x, nrpoints = 0)
Splines are used for interpolation and smoothing.
On the left, we see 9 points observed across the span of \(x\), and the goal is to fit a line through these points, not necessarily find a linear trend. On the left is a case where we see a bend in the data and the idea is to find a reasonable line representing this data cloud.
The initial starting point is to fit a straight line and glue them together. We can do so by fitting regression to subset of the data.
The problem, as apparent from the figure above, is that the result is not necessarily smooth or continuous where the lines meet, which is called the knots. Spline is a popular method to achieve continuity at the knots.
A spline is a piece-wise polynomial with pieces defined by a sequence of knots \[\eta_1 <\eta_2<\cdots<\eta_k\] such that the pieces join smoothly at the knots.
The simplest case is a linear spline with one knot \(\eta_1\) \[S(x)=\beta_0+\beta_1 x +\gamma (
x-\eta_1)_{+}\] term \(( x -
\eta_1)_{+}\) is 0 until \(x\)
is larger than \(\eta_1\).
For a spline of degree \(m\) one usually requires the polynomials and their first \(m - 1\) derivatives to agree at the knots, so that \(m - 1\) derivatives are continuous. A spline of degree \(m\) can be represented as a power series:
\[S(x)=\sum^m_{j=0}\beta_0 x^j+\sum^k_{j=1}\gamma_j(x-\eta_j)^{m}_{+}\]
The most popular splines are cubic splines: \[S(x)= \beta_0 +\beta_1x+\beta_2x^2+\beta_3x^3+\sum^k_{j=1}\gamma_j(x-\eta_j)^3_{+}\]
Suppose we know the values of a function at \(k\) points \(x_1 < \dots < x_k\) and would like to interpolate for other \(x\)’s. If we used a spline of degree \(m\) with knots at the observed \(x\)’s, we would have \(m + 1 + k\) parameters to estimate with only \(k\) observations. Obviously, we need some restrictions.
gridExtra::grid.arrange(
ggplot(women)+aes(x=height,y=weight)+geom_point(),
ggplot(data.frame(xi,yi))+aes(x=xi,y=yi)+geom_point()
,ncol=2)
A spline of odd degree \(m = 2\nu-1\) is called a natural spline if it is a polynomial of degree \(\nu - 1\) outside the range of the knots (i.e. below \(\eta_1\) or above \(\eta_k\)). A natural cubic spline (\(\nu=2\)) is linear outside the range of the data. For a natural spline, \[\begin{eqnarray*} \beta_j &=& 0 \mbox{ for } j=\nu,\dots,2\nu-1 \\ \sum^k_{i=1}\gamma_i\eta_i^j &=&0 \mbox{ for } j=0,1,\dots,\nu-1. \end{eqnarray*}\]
This imposes exactly \(m + 1\) restrictions, so we have \(k\) parameters left. Note that a natural cubic spline has the form
\[S(x)=\beta_0 +\beta_1 x+ \sum^k_{j=1}\gamma_j(x-\eta_j)^3_{+}\], subject to the restrictions \(\sum \gamma_j =0\) and \(\sum \gamma_j\eta_j =0\) so we end up with \(k\) parameters.
library(splines)
require(graphics); require(stats)
ispl <- interpSpline( women$height, women$weight ,bSpline=TRUE)
ispl2 <- interpSpline( weight ~ height, women ,bSpline=TRUE)
# ispl and ispl2 should be the same
par(mfrow = c(1,2), mgp = c(2,.8,0), mar = 0.1+c(3,3,3,1))
#par(mfrow=c(1,2))
plot( predict( ispl, seq( 55, 75, length.out = 51 ) ), type = "l" )
points( women$height, women$weight )
#plot( ispl ) # plots over the range of the knots
#points( women$height, women$weight )
#splineKnots( ispl )
ispl2 <- interpSpline( yi ~ xi ,bSpline=TRUE)
plot(predict( ispl2, seq( -1, 10, length.out = 51 ) ), type = "l")
points(xi,yi)
Consider now the problem of smoothing a scatterplot.
gridExtra::grid.arrange(
ggplot(dt)+aes(x=x,y=yg)+geom_point(),
ggplot(Auto)+aes(x=mpg,y=acceleration)+geom_point()
,ncol=2)
One approach is to select s suitable set of knots with \(k << n\) and then fit a spline by OLS (or WLS, or maximum likelihood).
For a cubic spline, this amounts to regressing \(y\) on \(k + 4\) predictors, namely \(1,x,x_2,x_3,(x-\eta_1)^3_{+},(x-\eta_2)^3_{+},...,(x-\eta_k)^3_{+}\) For a natural cubic spline, we would drop \(x^2\) and \(x^3\) and impose the additional constraints \[\sum\gamma = \sum \gamma\eta = 0.\] Actually, these constraints can be eliminated by suitable re-parametrization. For example, a natural cubic spline with two interior knots plus one knot at each extreme of the data can be fit by regressing \(y\) on three covariates, \(x\), \(z_1\), and \(z_2\), where and \[z_1 = ( x - \eta_1 )^3_{+} -\frac{ ( \eta_1 - \eta_4 )} {(\eta_3 - \eta_4)} ( x - \eta_3 )^3_{+}\] \[z_2 = ( x - \eta_2 )^3_{+} -\frac{( \eta_2 - \eta_4 )} {(\eta_3 - \eta_4)}( x - \eta_3 )^3_{+} .\]
A much better representation of splines for computation is as linear combinations of a set of basis splines called B-splines. These are numerically more stable, among other reasons, because each B-spline is non-zero over a limited range of knots. They are not so easy to calculate, but fortunately, R and S have functions for calculating a basis, see bs for B-splines and ns for natural B-splines. Regression splines are very popular because they are easy to use, and can be incorporated without difficulty as part of other estimation procedures. The main problem is where to place the knots. Often, they are placed at selected percentiles. A smarter strategy would place more knots in regions where \(f(x)\) is changing more rapidly. Knot placement is an arcane art form and the first disadvantage cited by detractors of regression splines.
gridExtra::grid.arrange(
ggplot(dt)+aes(x=x,y=yg)+geom_point(),
ggplot(Auto)+aes(x=mpg,y=acceleration)+geom_point()
,ncol=2)
xxmpg=seq(9,46.6,by=0.1)
predac=predict(lm(acceleration~bs(mpg,df=5),data=Auto),
newdata=data.frame(mpg=xxmpg))
plot(Auto$mpg,Auto$acceleration)
lines(xxmpg,predac,col="red")
A more formal approach to the problem is to consider fitting a spline with knots at every data point. It could fit perfectly but estimate its parameters by minimizing the usual sum of squares plus a roughness penalty. A suitable penalty is to integrate the squared second derivative, leading to the following criterion, known as the penalized sum of squares:
\[PSS = (y_i-S(x_i))^2 + \lambda(S''(x))^2dx\]
where integration is over the range of x, and \(\lambda\) is a tuning parameter. As \(\lambda\rightarrow 0\) we impose no penalty and end up with a very close fit, but the resulting curve could be very noisy as it follows every detail in the data. As \(\lambda\rightarrow \infty\) the penalty dominates, the solution converges to the OLS line, which is as smooth as you can get (the second derivative is always 0) but may be a very poor fit. Amazingly, it can be shown that minimizing the PSS for a fixed \(\lambda\) over the space of all continuous differentiable functions leads to a unique solution, and this solution is a natural cubic spline with knots at the data points. More generally, penalizing the squared v-th derivative leads to a natural spline of degree \(2\nu-1\). For proof, see Reinsch (1967).
smooth.spline2 <- function(formula, data, ...) {
mat <- model.frame(formula, data)
smooth.spline(mat[, 2], mat[, 1])
}
predictdf.smooth.spline <- function(model, xseq, se, level) {
pred <- predict(model, xseq)
data.frame(x = xseq, y = pred$y)
}
qplot(mpg, wt, data = mtcars) + geom_smooth(method = "smooth.spline2", se= F)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
qplot(x, yg, data = dt) + geom_smooth(method = "smooth.spline2", se= F)
## `geom_smooth()` using formula = 'y ~ x'
#### Cross-Validation We have solved the problem of knot placement, but
now we have to pick an appropriate value for \(\lambda\). Some claim this is easier
because we are left with a single number to worry about. Wabba and
others have suggested a technique known as cross-validation. Let \(S_{\lambda}(-i)\) denote the spline fit
with tuning parameter \(\lambda\) while
omitting the i-th observation. We can compare this fit with the observed
value \(y_i\), and we can summarize
these differences by computing a sum of squares \[CVSS(\lambda) = \sum_{i=1}^{n}
(y_i-\hat{S}_{\lambda} (x_i))^2 \] which depends on \(\lambda\). The idea is to pick \(\lambda\) to minimize the \(CVSS(\lambda)\). This sounds like a lot of
work but it isn’t, thanks again to regression updating formulas, which
can be used to show that \[CVSS(\lambda) =
\sum_{i=1}^{n} \frac{(y_i - \hat{S}_{\lambda} (x_i))^2}{ 1 -
A_{ii}}\]
where \(A_{ii}\) is a diagonal element of \(A = (I - \lambda K)-1\). This extends easily to WLS. An alternative criterion is to replace the Aii with their average, which is \(tr(A)/n\). This leads to a generalized CVSS that has been found to work well in practice.
Basis functions are mysterious, but it’s pretty neat once you get what bs and ns are doing.
fit1=lm(wage~bs(age,degree=1,knots=c(25,40,60)),data=Wage)
bs.weight1<-fit1$coefficients[-1]/sum(fit1$coefficients[-1])
bs.age1<-with(Wage,bs(age,degree=1,knots=c(25,40,60)))
xage=seq(18,80,by=1)
pred.bs.age1<- predict(bs.age1,newx=xage)
par(mfrow=c(1,3))
plot(range(xage),c(0,1),type="n")
for(i in 1:4) lines(xage,pred.bs.age1[,i])
plot(range(xage),c(0,1),type="n")
for(i in 1:4) lines(xage,bs.weight1[i]*pred.bs.age1[,i])
plot(Wage$age,Wage$wage,col="gray")
lines(xage,fit1$coefficients[1]+pred.bs.age1%*%fit1$coefficients[-1],type="l")
fit2=lm(wage~bs(age,degree=2,knots=c(25,40,60)),data=Wage)
bs.weight2<-fit2$coefficients[-1]/sum(fit2$coefficients[-1])
bs.age2<-with(Wage,bs(age,degree=2,knots=c(25,40,60)))
xage=seq(18,80,by=1)
pred.bs.age2<- predict(bs.age2,newx=xage)
par(mfrow=c(1,3))
plot(range(xage),c(0,1),type="n")
for(i in 1:5) lines(xage,pred.bs.age2[,i])
plot(range(xage),c(0,1),type="n")
for(i in 1:5) lines(xage,bs.weight2[i]*pred.bs.age2[,i])
plot(Wage$age,Wage$wage,col="gray")
lines(xage,fit2$coefficients[1]+pred.bs.age2%*%fit2$coefficients[-1],type="l")
fit3=lm(wage~bs(age,knots=c(25,40,60)),data=Wage)
bs.age<-with(Wage,bs(age,knots=c(25,40,60)))
xage=seq(18,80,by=1)
pred.bs.age<- predict(bs.age,newx=xage)
par(mfrow=c(1,3))
plot(range(xage),c(0,1),type="n")
for(i in 1:6) lines(xage,pred.bs.age[,i])
plot(range(xage),c(0,1),type="n")
for(i in 1:6) lines(xage,bs.weight1[i]*pred.bs.age[,i])
plot(Wage$age,Wage$wage,col="gray")
lines(xage,fit3$coefficients[1]+pred.bs.age%*%fit3$coefficients[-1],type="l")
If you want to draw confidence and prediction intervals, using the R predict function is the easiest thing to do.
# create random data
set.seed(12345)
x <- c(1:100)
y <- sin(pi*x/50)+rnorm(100,0,0.4)
epsilon <- rnorm(100, 0, 3)
knots <- c(10, 20, 30, 40, 50, 60, 70, 80, 90)
# Fit a natural spline
myFit <- lm(y ~ ns(x, knots = knots))
# Plot the result
plot(x,y)
lines(x,predict(myFit))
# Confidence interval
lines(x,predict(myFit,interval="confidence")[,"upr"],lty=2)
lines(x,predict(myFit,interval="confidence")[,"lwr"],lty=2)
# Prediction interval
lines(x,predict(myFit,interval="prediction")[,"upr"],lty=3)
## Warning in predict.lm(myFit, interval = "prediction"): predictions on current data refer to _future_ responses
lines(x,predict(myFit,interval="prediction")[,"lwr"],lty=3)
## Warning in predict.lm(myFit, interval = "prediction"): predictions on current data refer to _future_ responses
# How the point-wise standard error is created
# One way is to grab the model matrix from the fitted object
X <- model.matrix(myFit)
sigma <- summary(myFit)$sigma
var.Yhat <- (diag(X %*% solve(t(X) %*% X) %*% t(X))) * sigma^2
mean(predict(myFit,se.fit = TRUE)$se.fit-sqrt(var.Yhat) )
## [1] -1.179612e-16
# Another option is to call ns function
ppe <-predict(myFit,interval="confidence",se.fit = TRUE)
X.new <- cbind(1, ns(c(50:150), knots=knots))
var.Yhat <- (diag(X.new %*% solve(t(X) %*% X) %*% t(X.new)) + 1) * sigma^2
mean(predict(myFit,newdata =data.frame(x= c(50:150)),se.fit = TRUE)$se.fit-sqrt(var.Yhat) )
## [1] 0.3340086
There are two popular packages for fitting GAM in R.
gammgcvThe authors of the book like gam. But mgcv
is better at doing some things because they are Bayesian. The key
difference is that gam uses smoothing spline whereas
mgcv uses p-spline. Also, the uncertainty interval is
Bayesian for mgcv, which tends to be better calibrated.
84 Children at the Toronto Hospital for Sick Children underwent Laminectomy, a corrective spinal surgery for a variety of abnormalities under the general heading kyphosis. Results: 65 successes, 19 kyphosis still present. Goal: Try to understand/predict whether the operation will be successful
data(kyphosis, package = "gam")
detach("package:gam", unload=TRUE)
library(mgcv)
layout(matrix(1:3, nrow = 1))
spineplot(Kyphosis ~ Age, data = kyphosis,
ylevels = c("present", "absent"))
spineplot(Kyphosis ~ Number, data = kyphosis,
ylevels = c("present", "absent"))
spineplot(Kyphosis ~ Start, data = kyphosis,
ylevels = c("present", "absent"))
kyphosis_gam <- gam(Kyphosis ~ s(Age, bs = "cr") +
s(Number, bs = "cr", k = 3) + s(Start, bs = "cr", k = 3),
family = binomial, data = kyphosis)
trans <- function(x)
binomial()$linkinv(x)
layout(matrix(1:3, nrow = 1))
plot(kyphosis_gam, select = 1, shade = TRUE, trans = trans )
plot(kyphosis_gam, select = 2, shade = TRUE, trans = trans )
plot(kyphosis_gam, select = 3, shade = TRUE, trans = trans )
Air pollution data of 41 US cities. The annual mean concentration of sulfur dioxide, in micrograms per cubic meter, is a measure of the city’s air pollution. The question of interest here is what aspects of climate and human ecology, as measured by the other six variables in the data, determine pollution?
library(mgcv)
data(USairpollution, package = "HSAUR2")
USairpollution_gam <- gam(SO2 ~ s(wind, bs = "cr") +
s(log(temp), bs = "cr", k = 3) + s(log(precip), bs = "cr", k = 3)+ s(log(popul), bs = "cr", k = 3)
+ s(log(manu), bs = "cr", k = 3)+ s(log(predays), bs = "cr", k = 3),data = USairpollution)
SO2hat <- predict(USairpollution_gam)
SO2 <- USairpollution$SO2
plot(SO2hat, SO2 - SO2hat, type = "n",
xlim = c(-30, 110), ylim = c(-30, 60))
textplot(SO2hat, SO2 - SO2hat, rownames(USairpollution),
show.lines = FALSE, new = FALSE)
abline(h = 0, lty = 2, col = "grey")
layout(matrix(1:6, nrow = 1))
plot(USairpollution_gam, select = 1, shade = TRUE )
plot(USairpollution_gam, select = 2, shade = TRUE )
plot(USairpollution_gam, select = 3, shade = TRUE )
plot(USairpollution_gam, select = 4, shade = TRUE )
plot(USairpollution_gam, select = 5, shade = TRUE )
plot(USairpollution_gam, select = 6, shade = TRUE )